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Abstract 

Importance  sampling  is  a  variance  reduction  technique  for  efficient 
estimation  of  rare-event  probabilities  by  Monte  Carlo.  In  standard  im¬ 
portance  sampling  schemes,  the  system  is  simulated  using  an  a  priori 
fixed  change  of  measure  suggested  by  a  large  deviation  lower  bound 
analysis.  Recent  work,  however,  has  suggested  that  such  schemes  do 
not  work  well  in  many  situations.  In  this  paper,  we  consider  adap¬ 
tive  importance  sampling  in  the  setting  of  uniformly  recurrent  Markov 
chains.  By  “adaptive,”  we  mean  that  the  change  of  measure  depends 
on  the  history  of  the  samples.  Based  on  a  control-theoretic  approach 
to  large  deviations,  the  existence  of  asymptotically  optimal  adaptive 
schemes  is  demonstrated  in  great  generality.  In  this  framework,  the 
difference  between  a  static  change  of  measure  and  an  adaptive  change 
of  measure  amounts  to  the  difference  between  an  open-loop  control 
and  a  feed-back  control.  The  implementation  of  the  adaptive  schemes 
is  carried  out  with  the  help  of  a  limiting  Bellman  equation.  Also  pre¬ 
sented  are  numerical  examples  contrasting  the  adaptive  and  standard 
schemes. 


1  Introduction 

Among  variance  reduction  techniques  for  efficient  Monte  Carlo  simulation 
is  importance  sampling,  in  which  the  data  is  generated  using  a  probability 
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distribution  different  from  the  true  underlying  distribution.  It  can  be  es¬ 
pecially  effective  when  applied  to  the  estimation  of  expectations  that  are 
largely  determined  by  rare  events.  To  demonstrate  the  difficulty  involved  in 
simulating  rare  events  by  naive  Monte  Carlo,  we  consider  a  simple  example. 
Let  X  be  a  random  variable  taking  value  in  Rd,  and  suppose  we  are  inter¬ 
ested  in  estimating  p  =  P{X  e  A}  for  some  Borel  set  A  C  Rrf.  To  this  end, 
a  sequence  of  independent  and  identically  distributed  (iid)  copies  Xq,Xi,  . . . 
of  X  are  generated.  With  I *.  =  1  {xk&A}i  an  unbiased  estimate  for  p  based  on 
the  first  K  samples  is  just  the  sample  mean:  Qk  =  (To  +  h  +  •  •  •  +  Ik- i)/ K. 
The  relative  error  associated  with  this  estimator  is 

standard  deviation  of  Qk  \f  'P  ~  P2  1 

relative  error  = - — — - = - •  — 

mean  of  Qk  V  \JK 

Since  \fp  —  p2 Ip  — *  oo  as  p  tends  0,  a  large  sample  size  K  is  required  for  the 
estimator  Qk  to  achieve  a  reasonable  relative  error  bound.  For  example,  if 
p  =  10”8,  ten  billion  samples  are  required  to  achieve  a  relative  error  bound 
of  10%. 

The  basic  idea  of  importance  sampling  is  as  follows.  Suppose  that  X 
has  distribution  6,  and  consider  an  alternative  sampling  distribution  r.  It 
is  required  that  9  be  absolutely  continuous  with  respect  to  r,  so  that  the 
Radon-Nikodym  derivative  f(x)  =  ( d9/dr)(x )  exists.  Independent  and  iden¬ 
tically  distributed  samples  Xo,X\, . . .  with  distribution  r  are  generated.  Set 
Jfe  =  l{xfc€j4},  and  form  the  estimate 

1  K-l  _ 

Qk  =  j 7  hf{Xk) 

A  k= o 

in  lieu  of  Qk-  It  is  easy  to  check  that  Qk  is  an  unbiased  estimate  of  p,  with 
a  rate  of  convergence  determined  by 

var  [To/(X0)]  =  [  l{xeA}f(x)0(dx)  -p2. 

The  optimization  of  this  quantity  over  all  possible  r  is  inappropriate.  Indeed, 
taking  f(x)  =  p~l^{x^A}  T  is  the  conditional  distribution  of  X  given 
X  £  A),  the  variance  becomes  0,  but  this  change  of  measure  requires  the 
knowledge  of  the  unknown  parameter  p.  Instead,  one  typically  seeks  to 
minimize  over  parameterized  families  of  alternative  sampling  distributions. 

When  the  distribution  of  X  is  connected  to  a  large  deviations  problem,  a 
standard  heuristic  is  that  the  change  of  measure  used  to  prove  the  large  de¬ 
viation  lower  bound  should  be  a  good  (perhaps  nearly  optimal)  distribution 
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to  use  for  the  purposes  of  importance  sampling.  The  first  result  of  this  type 
was  given  by  Siegmund  [31].  The  basic  idea  was  subsequently  investigated 
in  many  contexts,  and  a  small  selection  of  the  literally  hundreds  of  papers  on 
the  topics  is  [1,  2,  3,  7,  8,  9,  11,  12,  15,  17,  19,  23,  24,  26,  27,  30].  Necessary 
and  sufficient  conditions  under  which  a  prescribed  scheme  is  asymptotically 
optimal  are  discussed  in  [10,  28,  29],  while  [20]  gives  a  survey  of  rare-event 
simulation. 

The  validity  of  the  heuristic,  however,  was  challenged  in  [18].  Counterex¬ 
amples  were  constructed  to  show  that,  under  some  very  common  settings, 
the  change  of  measure  suggested  by  large  deviations  leads  to  importance 
sampling  scheme  with  very  poor  properties. 

In  order  to  explain  these  counterexamples,  and  more  importantly,  to 
find  asymptotically  optimal  importance  sampling  algorithms  in  great  gener¬ 
ality,  [16]  introduces  an  adaptive  importance  sampling  scheme  and  shows  its 
asymptotic  optimality  in  the  setup  of  iid  random  variables  (Cramer’s  Theo¬ 
rem).  The  key  observation  is  that  many  changes  of  measure  are  suggested  by 
the  large  deviation  lower  bound  analysis,  and  one  must  consider  this  larger 
class  if  one  hopes  to  identify  importance  sampling  schemes  that  work  well 
in  general.  This  leads  to  the  development  of  schemes  where  the  sampling 
distribution  is  “adaptive”  in  the  sense  that  it  depends  on  the  history  of  the 
samples. 

The  present  paper  analyzes  the  estimation  of  rare-event  probabilities 
associated  with  uniformly  recurrent  Markov  chains.  More  precisely,  let 
{Yj,j  G  No}  be  a  uniformly  recurrent  Markov  chain  taking  values  in  a 
Polish  space  S ,  and  let  g  :  S  — >  be  a  bounded  measurable  function. 
Define  Sn  =  g(Yo )  +  g(Y\)  +  •  •  •  +  g(Yn_ i).  The  probability  of  interest  is 
P{Sn/n  G  A}  for  a  Borel  set  A  C  and  n  large.  An  asymptotic  optimality 
result  for  traditional  importance  sampling  is  available  in  the  one-dimensional 
case  (d  =  1),  under  the  assumption  which  implies  that  the  set  A  is  within  a 
half  interval  that  does  not  contain  the  expectation  of  g  under  the  invariant 
distribution  [9].  A  “dissection”  approach  was  introduced  for  the  high  di¬ 
mensional  case  [9].  This  approach  was  later  on  applied  to  Markov  additive 
sequences  [11],  and  was  also  implicitly  used  in  [18].  This  dissection  approach 
requires  that  one  appropriately  partition  the  set  A  into  a  finite  number  of 
subsets,  and  that  a  (possibly  different)  change  of  measure  be  applied  to  ef¬ 
ficiently  estimate  the  probability  of  each  individual  subset.  However,  there 
is  no  constructive  way  to  obtain  a  suitable  partition  in  general. 

In  this  paper  we  develop  adaptive  importance  sampling  schemes  for  uni¬ 
formly  recurrent  Markov  chains.  The  existence  of  asymptotically  optimal 
adaptive  schemes  is  demonstrated  for  arbitrary  dimension  d ,  under  very  mild 
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conditions  on  the  set  A.  It  turns  out  that  one  must  study  the  asymptotics 
of  a  small  noise  stochastic  game  in  order  to  analyze  the  optimality  of  impor¬ 
tance  sampling  schemes.  The  distinction  between  the  change  of  measures 
used  in  traditional  importance  sampling  and  adaptive  importance  sampling 
amount,  in  control  terminology,  to  the  difference  between  “open- loop”  and 
“feedback”  controls.  However,  open  loop  controls  are  not  nearly  optimal  in 
the  setting  of  stochastic  games,  except  for  very  special  cases.  For  this  rea¬ 
son,  the  traditional  importance  sampling  will  not  be  asymptotically  optimal 
in  general.  Our  analysis  indicates  that  the  adaptive  scheme  also  works  for 
estimating  functionals  (other  than  probabilities)  largely  determined  by  rare 
events. 

The  paper  is  organized  as  follows.  The  setting  of  the  problem  is  intro¬ 
duced  in  Section  2,  with  a  brief  description  of  the  large  deviations  principle 
for  uniformly  recurrent  Markov  chains.  We  also  give  the  definition  of  as¬ 
ymptotic  optimality  in  this  section.  In  Section  3  we  show  that  adaptive 
importance  sampling  schemes  designed  to  minimize  the  second  moment  are 
asymptotically  optimal.  Section  4  discusses  an  alternative  formal  PDE  ap¬ 
proach  to  the  adaptive  scheme,  and  describes  a  method  for  the  construction 
of  an  asymptotically  optimal  adaptive  scheme  that  does  not  directly  depend 
on  the  large  deviation  parameter  n.  Numerical  examples  are  presented  in 
Section  4.3.  Certain  technical  proofs  are  deferred  to  the  appendices  to  ease 
exposition. 

2  Problem  setup  and  background 

2.1  Problem  setup 

Let  Y  =  {Yj,j  G  No}  be  a  time-homogeneous  Markov  chain  taking  values 
in  a  Polish  space  S,  with  transition  probability  kernel 

P(x,  dy)  =  P  {Yj+ 1  G  dy  \  Yj  =  x}  . 

Let  g  :  S  — >  be  a  bounded  Borel-measurable  function,  and  define 

Sn  =  <?(! o)  +  g(Yi)  +  •  ■  ■  +  g(Yn_  i). 

For  an  arbitrary  Borel  set  A  C  Rd,  we  wish  to  estimate 

Pn  =  P{Sn/n  G  A}  . 

Throughout  the  paper,  we  will  make  use  of  the  following  uniform  recurrency 
assumption. 
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Condition  2.1  There  exists  a  probability  measure  vp  on  S,  an  integer  mo  G 
N,  and  a  pair  of  strictly  positive  real  numbers  a,  b  such  that 

avp(B)  <  p(mo\x,B)  <  bup(B) 

for  all  x  G  S  and  Borel  sets  B.  Here  p(rn'>  denotes  the  m-step  probability 
transition  kernel. 

For  example,  an  irreducible  Markov  chain  with  a  finite  state  space  is  always 
uniformly  recurrent. 

The  large  deviation  principle  for  a  uniformly  recurrent  Markov  chain  is 
well  known.  It  asserts  that  {Sn/n}  satisfies  the  large  deviation  principle 
with  a  convex  rate  function  L  :  — *  [0,  oo].  The  identification  of  L  is 

deferred  to  the  next  subsection.  We  will  impose  the  following  assumption 
throughout  the  paper. 

Condition  2.2  The  Borel  set  icKd  satisfies  the  condition 

inf  L(P)  =  inf  L(@). 

/3eA  ft  £A° 

Under  Conditions  2.1  and  2.2,  we  have  the  large  deviations  approximation 
lim  —  log  P{Sn/n  £d}  =  -  inf  L(/3). 

n->oo  n  fteA 

2.2  LDP  for  a  uniformly  recurrent  Markov  chain 

In  this  subsection  we  discuss  two  different  approaches  to  the  identification 
of  the  rate  function  L.  The  first  approach  suggests  a  parameterized  family 
of  change  of  measures  (see  Remark  2.1)  that  will  be  used  later  on  to  build 
importance  sampling  schemes.  The  second  approach  identifies  the  rate  func¬ 
tion  L  in  terms  of  relative  entropy,  and  will  be  used  in  the  analysis  of  the 
asymptotic  optimality  of  adaptive  schemes. 

The  first  approach  is  based  on  a  generalized  Perron-Frobenius  theorem. 
Fix  any  a  G  Then  by  [21],  the  non- negative  kernel 

exp  {{a,g(y))}p(x,dy) 

admits  a  unique  real  eigenvalue  exp{Ff(a)}  and  a  unique  (up  to  a  multi¬ 
plicative  constant)  eigenfunction  r{x\  a)  in  the  sense  that,  for  every  x  G  S, 

J  e<a,fl(2/)>r(y;  a)p[x,  dy)  —  eH^a)r{x\  a),  (2.1) 
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and  with  the  following  properties.  H(a )  is  an  analytic,  strictly  convex  func¬ 
tion  of  a  £  Kd  with  H{ 0)  =  0,  and  there  exist  0  <  ca  <  Ca  <  oo  such 
that 

ca  <  r(x;  a)  <  Ca,  VrGiS.  (2-2) 

The  paper  [21]  also  shows  that  the  rate  function  of  the  large  deviation  prin¬ 
ciple  for  {Sn/n}  is  the  convex  conjugate  of  H,  i.e., 

L((3)  =  sup  [( a,/3 }  —  H(a )]  .  (2-3) 

a£Rd 

Note  that  in  the  special  case  when  the  Markov  chain  Y  is  an  iid  sequence, 
H(a)  is  the  logarithm  moment  generating  function  of  g{Yj)  and  r{x\  a)  =  1. 
Therefore,  this  result  generalizes  the  classical  Cramer’s  Theorem,  at  least  for 
bounded  iid  random  variables.  For  the  case  when  Y  is  an  irreducible  Markov 
chain  with  finite  state  space,  exp{if(a)}  is  just  the  maximal  eigenvalue  of 
the  irreducible  non-negative  matrix  exp{(a,  g(y))}p(x,  dy),  and  ?’(•;  a)  is  the 
associated  right  eigenvector. 

Remark  2.1  It  is  not  difficult  to  see  that,  thanks  to  (2.1),  for  each  a  €  Rd, 

exp  {(a,g(y))  -  H(a)}  ■  T<f'  a|  -p(x,dy) 

r\x\  a) 

defines  a  probability  transition  kernel. 

Another  approach  is  the  weak  convergence  methodology  which  utilizes 
a  stochastic  control  representation  for  certain  exponential  integrals  [14].  It 
first  identifies  the  large  deviations  rate  function  for  the  empirical  measure 
of  the  Markov  chain  in  the  r-topology,  then  uses  contraction  principle  to 
obtain  the  rate  function  for  {Sn/n}.  We  will  need  the  following  definitions. 

For  an  arbitrary  Polish  space  Z,  we  denote  by  V(Z)  the  collection  of  all 
probability  measures  on  space  (. Z ,  B(Z)).  For  a  pair  of  probability  measures 
7,  g  G  V{Z),  the  relative  entropy  of  7  with  respect  to  g  is  defined  as 

if  7  7C  g  and  R(-y\\g)  =  00  otherwise.  Given  a  probability  transition  kernel 
q(x,  dy)  on  space  Z ,  we  define  gq  €  V(Z),  g<S>  q  G  V(Z  x  Z)  by 

gq{B)  =  jz  q(x,  B )  g(dx) 

(g®q)(DxB)  =  /  g(dx)q(x,dy)  =  /  q(x,B)g(dx) 

JDxB  Jd 
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for  all  Borel  sets  D,  B  C  Z.  The  collection  of  all  probability  transition 
kernels  on  Z  is  denoted  by  T(Z). 

The  weak  convergence  approach  identifies  the  rate  function  for  {Sn/n} 
in  terms  of  relative  entropy: 


L(/3)  =  inf  j-R(//<g>  q\\n®p)  ■  n  €  V(S),q  G  T(S),/j,q  =  n,  J^gdp  =  /?j  . 

(2-4) 

The  validity  of  the  representation  (2.4)  is  implied  by  the  results  in  [14,  Chap¬ 
ters  8  &  9],  where  the  large  deviation  principle  of  the  empirical  measures 
associated  with  Markov  chains  are  studied  under  weaker  assumptions. 

For  future  reference,  we  summarize  the  preceding  discussion  into  the 
following  proposition.  The  only  part  that  has  not  been  mentioned  is  the 
super  linearity  of  the  rate  function  L,  which  is  an  easy  consequence  of  (2.3) 
and  the  finiteness  of  H  [14,  Lemma  6.2.3(c)]. 

Proposition  2.1  Under  Condition  2.1,  the  sequence  {Sn/n}  satisfies  the 
large  deviation  principle  with  rate  function  L,  which  is  given  by  equations 
(2.3)  and  (2.f).  Moreover,  the  rate  function  L  is  convex,  lower-semicontinuous, 
and  superlinear  in  the  sense  that 


lim  inf 

N^oo  {/3eRd:||/3||>iV} 


m 


=  oo. 


In  particidar,  L  has  compact  level  sets. 


2.3  Asymptotic  optimality 

In  this  subsection  we  define  asymptotic  optimality  for  an  importance  sam¬ 
pling  scheme. 

Consider  a  probability  space  (fit,  IF,  P)  and  a  family  of  events  {An}  with 
lim  —  log  P{An}  =  -7, 

n— >oo  77, 

for  some  7  >  0.  A  general  formulation  of  importance  sampling  for  this 
problem  can  be  described  as  follows.  In  order  to  estimate  P{An},  a  generic 
random  variable  Zn  is  constructed  such  that  P{An}  =  EZn.  Independent 
replications  (Z®,  Z), ... ,  Z^1)  of  Zn  are  then  generated,  and  we  obtain  an 
estimator  by  averaging: 


Qn  = 


K 
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The  estimator  is  unbiased,  i.e.,  EQ%  =  P{An}.  The  rate  of  convergence 
associated  with  this  estimator  is  determined  by  the  variance  of  the  sum¬ 
mands,  or  equivalently,  their  second  moment  E[(Zn )2].  The  smaller  the 
second  moment,  the  faster  the  convergence,  whence  the  smaller  sample  size 
K  required.  However,  it  follows  from  Jensen’s  inequality  that 

lirnsup  —  —  \ogE\(Zn)2\  <  lim  ——log  (EZn)2  =  2q. 
n— >oo  n  n— >°°  n 

The  estimator  is  said  to  be  asymptotically  optimal  if 

lim  --log E\{Znf)  =  2q. 

n— >00  77, 

Remark  2.2  Since  the  performance  of  the  estimator  is  completely  de¬ 
termined  by  the  second  moment  of  its  generic,  iid  building  block  Z%,  we  will 
drop  the  superscript  k  hereafter.  Note  that  n  does  not  stand  for  sample  size, 
but  for  the  large  deviation  parameter. 


3  Statement  of  the  main  result 


The  adaptive  importance  sampling  scheme  we  consider  dynamically  selects 
the  change  of  measure  (or  the  parameter  a)  in  the  form  suggested  by  Remark 
2.1,  according  to  the  sample  history.  Naturally,  the  scheme  is  closely  related 
to  a  control  problem.  Let  the  control  an  =  {«”(-,  ■),  j  =  1, . . .  ,n  —  1}  be 
given,  where  each  a™  :  S  x  is  a  Borel-measurable  function.  Then 

the  state  dynamics  are  governed  by 

3- 1 

V  "  »(>?)•  ,/  —  0.  1 - ,  n. 

i= 0 


Here  we  set  Y0  =  Y0  =  y0 ,  and  for  j  >  1,  YJ1  is  conditionally  distributed, 
given  {Tjn,  i  =  0, 1, . . . ,  j  —  1},  according  to 


Vj(dy)  =  exp  {(a$,g(y))  - 


r(y;atj) 


■p(YJLi,dy ) 


with  (abusing  notation  a  bit)  a”  =  a7j(YJl_1,S'j/n). 

An  unbiased  estimator  of  P{Sn/n  €  A}  is  defined  as  the  average  of 
independent  copies  of 


71—1 

-  f{S™/neA}e  J-1V  11 

3= 1 


XL’ l{-{cq,g{Y?))+H(cq))  . 


r(yn.an)  ' 


Our  goal  is  to  minimize  the  second  moment,  hence  the  variance,  of  the 
summands  Xn  by  judiciously  choosing  the  control  an.  Thus  we  consider  the 
value  function  defined  by 


Vn(y0 )  =  mf  E[Xl] 


inf  E 

an 


V 

s%/neA}e 


n—  1 
3 


i1  (-=»<' 


ah9(yj 


n- 1  „2 


0>+2ff(a?))  TT 


li  r2 (Yjl-,aq) 


For  convenience  we  write  Vn(yo )  as  when  no  confusion  is  incurred.  We 
also  consider  the  log  transform 

Wn  =  --logPn. 
n 

We  have  the  following  result,  which  asserts  the  existence  of  asymptotically 
optimal  adaptive  importance  sampling  schemes. 


Theorem  3.1  Under  Conditions  2.1  and  2.2,  we  have 


lim  Wn  =  2  inf  L(B). 

n—>oo  p£A 

The  detailed  proof  is  deferred  to  Appendix  A.  It  is  worth  to  point  out 
that  the  construction  of  asymptotically  optimal  or  nearly  optimal  adaptive 
schemes  (i.e.  selection  of  the  control  an)  is  implied  by  a  dynamic  program¬ 
ming  equation  (DPE)  appearing  in  the  proof.  Since  the  proof  is  rather 
lengthy  and  technical,  it  makes  sense  to  give  an  outline  and  some  intuitive 
discussion  below,  so  that  readers  can  proceed  to  the  construction  of  the 
adaptive  schemes  (Section  4),  without  having  to  delve  into  the  technical 
details  of  the  proof. 

Outline  and  intuition  of  the  proof:  Thanks  to  the  discussion  in  Section 
2.3,  it  suffices  to  show  the  lower  bound 


lim  inf  Wn  >  2  inf  L(3).  (3.1) 

The  proof  will  utilize  the  DPE  that  is  satisfied  by  Wn .  In  order  to  do  so, 
we  first  extend  the  dynamics.  Abusing  notation  a  bit,  for  x  €  y  G  S, 
and  i  G  {0, 1, . . . ,  n},  define  the  dynamics 

j- 1  _ 

Sij  =  nx  +  Y l  Y$,  j  =  i,...,n. 
e=i 
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Here  we  set  Y,t^  =  y ,  and  for  j  >  i  + 1,  Y^-  is  conditionally  distributed,  given 
{y^,  £  =  i, . . .  ,j  —  1},  according  to 

<i(^)  =  exp  {(o$,g(z))  -  H{a ?)}  •  dz), 

where  a™  =  a’j(Y™j_1,S™j/n).  The  original  control  problem  corresponds  to 
x  =  0,  i  =  0,  y  =  yo-  Define  analogously 


Vn(x,y;i ) 

=  inf  E 


1  1,1  (-2(an,(/(Yn.)>+2^(an))  TT  ^ >  aj  ) 

l{5?n/neA}e^=l+l1  1  ^  II  - - - — 


i=*+i 


r2(yn;an) 


and  its  log  transform 

Wn(x,y,i )  =  --logHn(x,y;i). 
n 

The  terminal  conditions  are 


Vn(x,y,n )  =  1a(z),  lTn(x,y;n)  =  oo  •  1  a{x). 


Since  it  is  inconvenient  to  study  a  problem  with  an  oo  terminal  condition,  we 
instead  work  with  a  mollified  version  of  the  control  problem.  Let  F  :  Rd  — >  R 
be  an  arbitrary  bounded  and  Lipschitz  continuous  function.  Suppose  that 
V]}  is  defined  as  Vn ,  save  that  the  indicator  function  lr^„  /neA\  is  replaced 

by  exp{— 2 nF(S™n/n)}.  Similarly  define 

W$(x,y,i)  ± -~\ogVP(x,y,i).  (3.2) 

n 

Since  Vp  is  the  value  function  of  a  control  problem,  one  can  write  down  the 
DPE  for  Vp.  Substituting  (3.2)  in  this  DPE,  one  obtains  an  equation  for 
Wp]  see  (A.l).  The  proof  of  the  desired  inequality  (3.1)  is  based  on  the 
analysis  of  this  recursive  equation  for  Wp. 

The  relative  entropy  representation  for  exponential  integrals  [14,  Propo¬ 
sition  1.4.2]  states  that 


—  log  J  e  f^y(dx) 


inf 

7eP(S) 


R( 7  II/*)  + 


(3.3) 
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for  all  bounded  and  Borel  measurable  functions  /.  Applying  this  represen¬ 
tation  formula  to  the  equation  for  Wp,  one  obtains 


Wp(x,y,i) 


=  sup  inf 

aeRd 7eP(5) 


JWf( x+  ^g(y),z-,  i  +  l")  7 (dz) 

+  ^  (^(t(-)  b(y,  •))  +  J  (a,g(z))7(dz)-H(a) 


+  ~  I  log  ?  a\l{dz) 
n  J  r(y ;  a) 


(3.4) 


This  equation  suggests  that  Wp  is  the  lower  value  of  a  discrete-time  sto¬ 
chastic  game.  One  of  the  two  players  of  the  game  (the  a-player)  selects  the 
parameter  a,  and  is  the  weaker  player.  The  other  player  (the  7-player)  is  the 
stronger  player,  and  selects  the  distribution  7  that  determines  the  evolution 
of  the  state.  The  right  hand  side  of  (3.4)  would  take  a  simpler  form  if  we 
could  permute  the  sup  and  inf.  However,  this  is  not  (in  general)  possible, 
since  the  last  term 


r(z;  a ) 
r(y;a) 


7  (dz) 


(3.5) 


may  not  be  concave  in  a. 

This  difficulty  is  also  the  main  distinction  from  the  setting  of  Cramer’s 
Theorem  where  the  Markov  chain  Y  reduces  to  an  iid  sequence  of  random 
variables.  The  latter  case  gives  r(x;  a)  =  1  and  the  unpleasant  term  (3.5) 
disappears,  whence  min/max  theorem  can  be  applied  to  convert  the  DPE  of 
Wp  into  a  DPE  associated  with  a  control  problem,  which  is  much  simpler 
to  analyze  than  a  game  [16].  However,  the  interchange  of  sup  and  inf  is  not 
possible  with  (3.4)  as  written. 

The  key  idea  to  overcome  this  difficulty  and  to  obtain  a  lower  bound 
for  Wp  is  as  follows.  Fix  an  integer  m,  and  consider  a  variant  of  the  game 
where  the  a-player  is  constrained  to  policies  such  that  a  must  be  constant 
over  time  intervals  of  length  1/m.  This  new  game  is  even  more  favorable 
to  the  7-player,  whence  it  will  have  a  smaller  lower-value.  Letting  n  go  to 
infinity,  the  lower  value  of  the  new  game  converges  to  a  function  Up,  and 
we  expect 


liminf  Wp(x,  y;  I  nk/m\)  >  UP(x;  k),  k  —  0, 1, . . . ,  m. 

n — >00 

A  bonus  of  taking  the  limit  is  that  the  troubling  terms  (3.5),  which  can  be 
interpreted  as  part  of  the  running  cost,  cancel  off,  and  it  is  not  difficult  to 
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guess  that  U™  should  satisfy 


Up‘{x\  k)  =  sup  inf 


UP 


m 


x  H - /?;  k  +  1  j  H - (jL(/3)  +  (o',  /?)  — 


m 


(3.6) 


with  terminal  condition 

Uf{x-,m)=2F{x).  (3.7) 

In  the  proof,  t/™  is  in  fact  defined  recursively  through  equations  (3.6)  and 
(3.7). 

Equation  (3.6)  is  much  easier  to  analyze.  Analogous  to  [16],  one  can 
show  by  a  weak  convergence  argument  that 

liminf  UF(x,  0)  >  2  inf  {Li  (3)  +  Fix  +  /?)},  (3-8) 

m—>oo  ft(z  rod 


which  in  turn  implies 

liminf  WS (x,  y;0)  >2  inf  {Li (3)  +  Fix  +  /?)}. 

n— >°°  /3eKd 


Letting  x  =  0  and  the  mollifier  F  tend  to  oo  •  1a,  one  arrives  at  the  desired 
inequality  (3.1).  ■ 

The  following  result  is  useful  in  the  identification  of  optimal  adaptive 
importance  sampling  scheme  in  Section  4. 

Corollary  3.2  Fix  an  arbitrary  x  G  and  a  bounded  Lipschitz  continuous 
function  F  :  — >  R.  Assume  Condition  2.1,  and  define  Up 1  recursively  by 
(3.6)  with  the  terminal  condition  (3.7).  Then 

lim  U‘p(x ;  [tm\)  =  2 Up(x,t),  V  t  G  [0, 1], 

where 

UF(x,t)=  inf  {(1  —  t)L(fi)  +  F(x  +  (1  —  t)(3)}.  (3.9) 

/SeM^ 

Proof.  We  will  show  the  inequality  for  t  =  0.  The  case  with  general  t  G  [0, 1] 
is  similar  and  thus  omitted. 

Thanks  to  (3.8),  it  suffices  to  prove 

lim  sup  Upix;  0)  <  2UF(x,  0)  =  2  inf  {Li (3)  +  F(x  +  (3)}. 

m— >oo  /3eMd 
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Fix  an  arbitrary  [3  £  Rd.  The  recursive  definition  of  Up  (3.6)  and  equation 
(2.3)  yield 

UP(x;k)  <  sup  Up  (x  +  —  (3\  k  +  l)  +  —  (L(/3)  +  (a,/3)  -  H(a)) 

ama  L  V  m  J  m  J 

=  Up  (x  +  -(3-  k  +  l)+  —L{(3). 

V  m  J  m 

Repeatedly  applying  this  inequality  for  k  =  0, 1, . . . ,  m  —  1,  we  arrive  at 
UP(x-  0)  <  UP [x  +  (3;m)  +  2 L(J3)  =  2 F[x  +  (3)  +  2 L(J3), 
thanks  to  (3.7).  This  completes  the  proof.  ■ 

4  Implementation  issues  and  examples 

4.1  The  limit  control  problem  and  implementation  issues 

Theorem  3.1  establishes  the  existence  of  asymptotically  optimal  adaptive 
sampling  schemes.  However,  it  does  not  explicitly  discuss  the  construction 
of  such  schemes.  On  the  other  hand,  the  proof  of  the  theorem  implies  that 
one  approach  of  construction  would  be  to  solve,  numerically  if  need  be, 
the  DPE  (3.4)  associated  with  Wp  ( Wn  equals  Wp  when  F  =  oo  •  1^). 
However,  this  approach  may  not  only  require  a  lot  of  computation  effort, 
but  the  resulting  adaptive  sampling  control  (i.e. ,  control  an )  will  directly 
depend  on  n.  In  general,  one  would  prefer  schemes  without  this  dependence. 

An  alternative  approach  is  to  consider  the  DPE  associated  with  the  limit 
problem  of  Up  as  m  tends  to  infinity.  To  this  end,  we  rewrite  equation  (3.6) 
as 

0=  sup  inf  AUP  +  —  (L((3)  +  (a,  (3)  —  H(a))  , 

QgKd/3eKd  L  m 

where 

A  up  =  up(x+  1/?;  k  +  lj-  up(x ;  k ). 

Suppose  that  a  t  subscript  denotes  the  partial  derivative  with  respect  to 
t,  and  that  an  x  subscript  denotes  the  vector  of  partials  with  respect  to 
Xi,  i  —  1, . . . ,  d.  Since  Corollary  3.2  (for  F  bounded  and  Lipschitz  continu¬ 
ous)  asserts  that 

lim  Up(x; \tm\ )  =  2Uf(x]  t), 

m — >oo 
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we  have  formally  the  approximation 

A  UP  «  (—(3,  (2  UF)X)  +  -(2  UF)t. 

Substituting  this  back,  we  have 

0  =  sup  inf  [(/?,  (2UF)X)  +  (2UF)t  +  L((3)  +  {a,  (3)  -  H(a)\ 
a£Rd  /3eRd 

=  (2UF)t  +  sup  inf  [L((3)  +  (a  +  (2UF)X,(3)  ~  H(a)]  . 
aeRd  /3eRd 

Representing  the  infimum  in  terms  of  the  Legendre  transform  H  of  L  gives 
0  =  (2 UF)t  +  sup  [-H  (-a  -  (2Uf)x)  -  H(a)\  . 

a£Rd 

The  strict  convexity  of  H  implies  that 

a*(x,t)  =  -(UF)x(x,t),  (4.1) 

and  that 

0  =  (UF)t  -  H  (~(UF)X) .  (4.2) 

The  equation  (4.1)  identifies,  at  least  formally,  an  optimal  feedback  con¬ 
trol  policy.  However,  this  observation  is  not  entirely  satisfactory  for  several 
reasons.  The  first  is  that  even  if  we  have  an  exact  formula  for  UF,  the 
partial  derivatives  may  not  be  defined  for  all  times  and  spatial  points.  A 
second  reason  is  that  UF  does  not  usually  have  an  explicit  solution,  and  so 
in  many  cases,  numerical  approximation  is  required.  In  order  to  obtain  a 
formal  characterization  of  a*  that  is  more  useful,  we  observe  that,  thanks  to 
the  definition  (3.9)  of  UF  and  the  convexity  of  L,  UF  is  the  value  function 
of  the  deterministic  control  problem 

UF(x,t)  —  inf  [  L(<f>(s))ds  +  F(<f>(l))  , 

'P  Ut 

where  the  infimum  is  over  all  absolutely  continuous  <f>  which  satisfy  < f)(t )  =  x. 
It  is  straightforward  to  see  from  this  control  problem  that  an  optimal  control 
at  (x,t)  is  the  minimizer  in  (3.9),  say  (3*(x,t),  thanks  to  the  convexity  of 
L.  Note  that  under  very  mild  conditions  (3*{x,t )  will  always  exist,  e.g.,  if  F 
is  continuous  and  bounded.  The  standard  dynamic  programming  argument 
implies  that  UF  (in  a  weak  sense)  satisfies  the  DPE 

0  =  (UF)t  +  inf  [L{a)  +  (a,  (C^)x)]  =  (UF)t  -  H(-(UF)X), 

aeRd 
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which,  not  surprisingly,  is  just  equation  (4.2).  The  optimal  control  /3*(x,t) 
is,  at  least  formally,  the  minimizer  in  the  DPE,  or,  /3*(x,  t )  and  —(Uf)x(x,  t ) 
are  conjugate.  It  follows  that 

a*(x,t )  is  conjugate  to  the  minimizer  (3*(x,t)  in  (3.9). 

At  points  where  (Up) X{x,  t )  exists  this  definition  gives  a*(x,  t )  =  —(Uf)x(x,  t ) 
At  points  where  (Up)x(x,t)  does  not  exist  there  are  multiple  minimizing 
(3*(x,t),  and  one  should  define  a*(x,t)  through  conjugacy  in  any  Borel  mea¬ 
surable  way. 

Remark  4.1  The  original  (unmollified)  problem  corresponds  to  F  =  oo-l a- 
In  this  case, 

[3*(x,t)  £  argmin{(l  —  t)L{(3)  :  x  +  (1  —  t)/3  £  A},  (4.3) 

and  a*(x,t)  is  its  conjugate.  Roughly  speaking,  this  means  that  the  con¬ 
trolled  random  walk  always  points  to  the  “most  likely”  end  point,  given  that 
the  end  point  is  in  A  and  that  we  are  currently  at  x  with  1  —  t  units  of  time 
to  go.  Note  that  we  can  assume  without  loss  that  A  is  closed.  This  suffices 
to  guarantee  the  existence  of  (3*(x,t)  for  all  x  £  Rd,t  £  [0, 1). 


4.2  A  numerical  example 


Consider  a  simple  finite-state  Markov  chain  Y  with  state  space  S  =  {1,-1}, 
and  probability  transition  matrix 


Q 


p  l  —  p 

1  o 


[Q(i,j)\ 2x2, 


for  some  constant  p  £  (0, 1).  Define  g  :  S  — >  M  by  g(x)  =  x,  and  Sn  = 
g(Y0)  +  g(Yi)  +  ■  ■  •  +  g{Yn- 1)  =  Y0  +  Y\  +  •  •  •  +  Tn-i-  Assume  that  the 
initial  state  is  Y0  =  1.  We  wish  to  estimate  P{Sn/n  £  A}  for  some  Borel 
set  A  £  M. 

Since  Y  is  an  irreducible  finite-state  Markov  chain,  the  eigenvalue  eH ^ 
and  eigenfunction  r(-;  a),  as  defined  in  (2.1)  for  a  £  R,  are  just  the  maximal 
eigenvalue  of  the  kernel 


[eajQ{hj)\ 


pea  (1  —  p)e  a 
ea  0 


and  the  corresponding  eigenvector,  respectively.  Simple  algebra  gives 


H{a)  =  log 


pea  +  i/p2e2a  +  4(1  —  p) 
2 


V  a  £  M, 
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which  is  a  convex  function  with  H( 0)  =  0,  and  an  eigenvector 


r(l:a) 

'  eH(a)  ' 

r(-l-a) 

ea 

Therefore,  for  any  given  a  G  R,  the  corresponding  change  of  measure  is 
represented  by  the  probability  transition  matrix 


Qa 


eaj-H(a) 


r(j]a ) 
r(r,  a) 


Q(i,j ) 


pea  H (1  —  p)e  2H(a) 

1  0 


(4.4) 


In  order  to  compute  the  convex  conjugate  of  H ,  we  first  observe  that  H'{— oo) 
0  and  H'(+ oo)  =  1.  It  follows  immediately  that  L(/3)  =  ooif/3<0or/?>l. 
For  (3  G  (0, 1),  some  algebra  yields 


L((3)  =  sup  [ot(3  —  H(a)\ 

a£K 

/ 3  4(1—  p)/32  1  1  — /3  1  ,  . 

=  21°V(l-W  +  2  gi+g~2l0g(1~P> 


with  the  minimize!' 


*  •  *m\  1  i  4(1  ~P)P2 

a  =a«3)  =  -logp2(1_/j2). 


(4.5) 


and 


£(0)  =  1™L(/3)  =  -9 log(i  -p).  L(i)  =  1i™L(/?)  =  -1osp- 

p4,(j  z  pT  1 


Furthermore,  L((3)  =  0  if  and  only  if  f3  =  H'(0)  =  p/(2  —  p).  Of  course, 
H'( 0)  =  Js  g(y)  v(dy)  where  ^  is  the  stationary  distribution  of  the  Markov 
chain  Y . 

We  are  interested  in  estimating  pn  =  P{Sn/n  G  A}  for  the  Borel  set 
A  =  (— 00,  a]  U  [b,  00),  0  <  a  <  H' (0)  <  b  <  1. 


In  all  the  following  discussion,  we  take  p  =  1/2,  a  =  1/6,  b  =  1/2,  which 
implies 

mf ^L(P)  =  L(b)  <  L(a).  (4-6) 

We  will  compare  the  naive  Monte  Carlo  simulation,  traditional  importance 
sampling,  and  adaptive  importance  sampling  schemes  below. 
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The  naive  Monte  Carlo  simulation  will  simulate  the  Markov  chain  under 
the  original  transition  probability  kernel  Q.  One  can  also  regard  this  as  a 
special  change  of  measure  with  the  corresponding  a  =  0.  In  this  case,  the 
estimate  is  just  the  sample  mean  of  K  iid  replications  of 


Xn  —  1{  Sn/neA}- 


Since  the  second  moment  of  Xn  satisfies 


lim - log  E 

n — >oo  77, 


lim  —  —\ogpn  =  inf  L{(3)  =  L(b )  <  2 L(b), 
n— >oo  n  0&A 


the  naive  Monte  Carlo  sampling  is  not  asymptotically  optimal. 

Thanks  to  (4.6),  the  traditional  importance  sampling  will  take  (3*  =  b, 
and  a*  is  then  defined  by  (4.5).  The  algorithm  will  generate  a  Markov 
chain  Y  with  probability  transition  matrix  Qa*  and  Yq  =  1.  Let  Sn  = 
Yq  +  ■  ■  ■  +  Yn-\.  The  estimate  is  the  sample  mean  of  K  iid  replications  of 


n—  1 


n—  1 


Xn  ~  1  {Sn/neA}  II  e 


-a*Yj+H(a*)  _  TT 


j= 1 


=i  r(Yj;a*) 


Sn/n£A}e 


-a*Sn+nH(a*)  .  ca*YQ-H(a *)  r(Y0ia*) 

r(F„_  i;a*) 


Since  ?’(•;«*)  is  clearly  bounded  from  above  and  bounded  away  from  zero, 
it  is  not  difficult  to  see  that 


lim - log  E  (Xn)2  =  lim - log  Li 

n— >oo  fi  J  n^>  oo  77, 


1  .  -2  n(a*Srl/n-H(a*)) 

1{S„/neA}e 


Simple  computation  yields  that  { Sn/n }  satisfies  the  large  deviation  principle 
with  rate  function  L((3)  =  L(/3 )  +  H{a*)  —  a* (3.  Now  one  can  apply  the 
Varadhan’s  Theorem  [14,  Theorem  1.3.4]  (with  slight  modification)  to  show 


lim - log  E 

n — »oo  77, 


2a*/3-2H(a*)  +  L(f3) 


inf 

/3eA  L 

jn^  [a* (3  —  H(a*)  +  L{f3)]  . 


In  the  configuration  of  this  example,  the  infimum  in  the  right-hand-side  is 
achieved  at  (3  =  a,  and 


lim - log  E 

n — xx)  77, 


aa*  —  H(a*)  +  L(a)  <  2 L(b). 
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Therefore,  the  traditional  importance  sampling  scheme  is  not  asymptotically 
optimal. 

In  Section  3,  we  argue  the  existence  of  asymptotically  optimal  adaptive 
importance  sampling  schemes  in  general.  The  construction  of  such  adaptive 
schemes  involves  the  selection  of  a  nearly  optimal  control  oP  =  {«”(•,•)  : 
j  =  0, 1, . . . ,  n  —  1}.  It  was  formally  suggested  in  Section  4.1  that  a  good 
choice  is  to  sample  YJ1,  conditional  on  { Y J1,  i  =  0, . . . ,  j  —  1},  according  to  the 
transition  probability  matrix  Qa  as  in  (4.4)  with  a  being  the  conjugate  of 

f3*(x,  t )  given  in  (4.3)  where  x  =  S™/n  =  (Yq M - and  t  =  l—j/n. 

In  case  the  conjugate  of  (3*(x,t )  is  oo  or  — oo,  a  is  taken  as  a  large  positive 
or  negative  number;  see  Remark  4.3  for  more  details.  The  estimate  is  the 
sample  mean  of  K  iid  replications  of 


X  -1  -  pY:]-i(-K’9(Yp))+H(a 

Xn  ~  1  {S%/neA}e  3  '  J 


n— 1 

'').n 

3= 1 


rjYjL^oq) 

r(YJL\a^) 


The  numerical  results  show  that  the  controls  constructed  in  this  way  have 
asymptotically  optimal  performance  (Table  6). 

The  numerical  results  are  reported  below  for  n  =  60.  The  theoretical 
value  of  pn  is 


pn  =  P{Sn/n  <  a}  +  P{Sn/n  >b}  =  0.83%  +  2.44%  =  3.27%. 

See  Remark  4.2  for  the  computation  of  this  theoretical  value.  For  each 
tableau,  we  run  four  simulations  each  with  sample  size  K  =  10000. 


No.  1 

No.  2 

No.  3 

No.  4 

Estimate  fin  (%) 

3.11 

3.20 

3.23 

3.09 

Standard  Error  (%) 

0.17 

0.18 

0.18 

0.17 

95%  Confidence  Interval  (%) 

[2.76,  3.46] 

[2.85,  3.55] 

[2.88,3.58] 

[2.74,3.44] 

Table  1.  Naive  Monte  Carlo  Scheme 


No.  1 

No.  2 

No.  3 

No.  4 

Estimate  fin  (%) 

2.41 

2.48 

2.44 

16.71 

Standard  Error  (%) 

0.04 

0.04 

0.04 

14.22 

95%  Confidence  Interval  (%) 

[2.34,  2.48] 

[2.41,  2.56] 

[2.37,2.51] 

[-11.73,45.15] 

Table  2.  Traditional  Importance  Sampling  Scheme 


No.  1 

No.  2 

No.  3 

No.  4 

Estimate  fin  (%) 

3.17 

3.21 

3.35 

3.33 

Standard  Error  (%) 

0.15 

0.13 

0.17 

0.18 

95%  Confidence  Interval  (%) 

[2.86,  3.47] 

[2.85,  3.47] 

[3.00,3.69] 

[2.96,3.70] 

Table  3.  Adaptive  Importance  Sampling  Scheme 
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An  interesting  observation  is  that  the  traditional  importance  sampling 
scheme  exhibits  seemingly  bizarre  and  inconsistent  simulation  results  (Table 
2).  The  explanation  is  as  follows.  Under  the  alternative  sampling  distrib¬ 
ution  Qa*,  most  of  the  sample  means  Sn/n  will  end  up  near  the  point  b. 
However,  a  few  samples  ( “rogue”  trajectories)  have  means  that  fall  into  the 
interval  (—00,  o].  Even  though  the  “rogue”  trajectories  are  rare,  the  Radon- 
Nikodym  derivatives  associated  with  them  are  so  large  that  they  dominate 
the  variance.  In  simulation  No.  4,  the  presence  of  a  single  “rogue”  trajec¬ 
tory  greatly  raises  the  standard  error  associated  with  the  estimate.  Indeed, 
the  proportion  of  the  contribution  to  the  second  moment  from  this  single 
“rogue”  trajectories  is  more  than  99%.  In  simulations  No.  1,  No.  2,  and  No. 
3,  however,  there  are  no  “rogue”  trajectories,  and  the  standard  error  associ¬ 
ated  with  the  estimate  is  deceptively  small.  The  reason  is  that  the  standard 
error  is  itself  estimated  from  the  sample.  Without  “rogue”  trajectories,  we 
actually  underestimate  the  standard  error.  Therefore,  we  cannot  put  much 
confidence  in  the  standard  errors  thus  obtained,  or  in  the  “tight”  confidence 
intervals  that  follow.  Note  that  the  confidence  intervals  from  these  three 
simulations  do  not  contain  the  true  value.  Similar  phenomenon  also  occurs 
in  the  setting  of  Cramer’s  Theorem,  that  is,  where  the  Markov  chain  Y 
reduces  to  a  sequence  of  iid  random  variables;  see  [18,  16]. 

In  contrast,  the  adaptive  importance  sampling,  on  the  other  hand,  yields 
more  accurate  estimates  and  its  performance  is  much  more  stable.  Even 
though  it  does  not  show  great  advantage  over  naive  Monte  Carlo  simulation 
for  n  =  60,  it  quickly  does  so  when  n  gets  larger.  The  numerical  results  for 
different  n  (with  K  =  10000  fixed  as  before)  are  reported  in  Table  4-6. 

The  naive  Monte  Carlo  does  not  work  well  for  bigger  n.  For  n  =  120 
and  n  =  180,  it  yields  estimates  with  large  standard  errors,  and  for  n  =  240, 
the  simulation  yields  an  estimate  0,  i.e.,  no  sample  mean  reaches  the  target 
set  A.  As  for  the  traditional  importance  sampling,  each  simulation  gives  a 
very  “tight”  confidence  interval,  due  to  the  absence  of  “rogue”  trajectories. 
However,  as  discussed  before,  we  cannot  put  much  belief  into  these  estimates. 
Indeed,  none  of  these  confidence  intervals  cover  the  true  value  of  pn. 

On  the  other  hand,  the  adaptive  importance  scheme  yields  much  more 
accurate  estimates.  In  Table  6,  the  variable  Vn  denotes  the  sample  estimate 
of  the  second  moment  E  [(Xn)2] .  Observe  that  as  n  gets  larger,  the  ratio 

-£log  E[(Xn)2}  _  -  log  E[(xn)2]  _  —  log  Vn 
"log  Pn  -log  Pn  ~  -log  Pn 

approaches  2.  In  other  words,  the  adaptive  importance  sampling  scheme  is 
approaching  optimality. 
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n  =  120 

n  =  180 

n  =  240 

Theoretical  pn 

1.61  x  10_;3 

9.66  x  10_t> 

6.35  x  10"B 

Estimate  pn 

1.80  x  lO"13 

20.00  x  10~3 

0 

Standard  Error 

0.42  x  10”a 

14.14  x  10'3 

NA 

95%  Confidence  Interval 

[0.95,2.65]  x  10-a 

[-8.28,48.28]  x  10~° 

NA 

Table  4.  Naive  Monte  Carlo  Simulation 


n  =  120 

n  =  180 

n  =  240 

Theoretical  pn 

1.61  x  10'3 

9.66  x  10"3 

6.35  x  10“b 

Estimate  pn 

Ooxlr3 

8.76  x  lO"3 

6.01  x  10_b 

Standard  Error 

0.02  x  10'3 

0.18  x  lO"3 

0.13  x  lO"3 

95%  Confidence  Interval 

[1.35,1.45]  x  lO"3 

[8.41,9.12]  x  10~3 

[5.74,6.28]  x  10~° 

Table  5.  Traditional  Importance  Sampling  Scheme 


n  =  120 

n  =  180 

n  =  240 

Theoretical  pn 

1.61  x  10'a 

9.66  x  10-3 

6.35  x  10"° 

Estimate  pn 

1.56  x  10'3 

9.73  x  10"3 

6.29  x  10~° 

Standard  Error 

0.04  x  10-3 

0.15  x  10“3 

0.07  x  10~° 

95%  Confidence  Interval 

[1.49, 1.63]  x  10”a 

[9.44,10.02]  x  10-B 

[6.15,6.43]  x  10-° 

(-log  Vn)/{-\ogpn) 

1.72 

1.87 

1.93 

Table  6.  Adaptive  Importance  Sampling  Scheme:  Asymptotic  Optimality 

Remark  4.2  The  theoretical  value  of  pn  can  be  computed  as  follows.  Let 
Xn  be  the  number  of  —  l’s  in  a  trajectory,  i.e.  Xn  =  o  l{yj=-i}-  Since 
Yo  =  1  and  Q(— 1,1)  =  1,  we  have  0  <  Xn  <  n/2  with  probability  one. 
Clearly  Sn  =  n  —  2Xn,  whence  it  suffices  to  compute  P{Xn  >  m)  for 
all  non-negative  integers  m  such  that  2m  <  n.  But  if  we  define  T±  = 
inf  {j  >  0  :  Yj  ==  —!},  then  Tj  >  1  and  Ti  —  1  is  geometrically  distributed 
with  parameter  (1  —  p).  Moreover,  =  —  1,  and  Yi+Tt  =  1-  Now  recur¬ 
sively  define  for  i  >  2, 


Ti  =  M{j  >  1+Ti_i  :  Yj  =  -1}. 

Then  {Tj  —  1,  Ti  —  Tj  —  2,  T3  —  T2  —  2, . . .}  is  clearly  a  sequence  of  iid  geo¬ 
metrically  distributed  random  variables  with  parameter  (1  —  p) ,  and 

P(Xn  >m)  =  P(Tm  <  n )  =  P(Tm  —  2m  +  1  <  n  —  2m  +  1). 


But 


Tm.  —  2  m  +  1  —  (Tj  —  1)  +  (T2  —  Tj  —  2)  +  ■  ■  •  +  (Tm  —  Tm_  1  —  2), 
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is  the  sum  of  iid  geometrically  distributed  random  variables,  whence  has  a 
negative  binomial  distribution  with  parameter  m  and  (1  —  p).  Standard  soft¬ 
wares  such  as  Splus  contain  the  cumulative  distribution  functions  of  negative 
binomial  distributions,  and  can  easily  yield  the  desired  probabilities. 

Remark  4.3  If  f3*(x,t)  >  1,  then  its  conjugate  is  a*(x,t)  =  +oo,  in  the 
sense  that 

L(P*(x,  t))  =  sup  [a/3*(x,  t)  —  H(a)]  =  lim  [a/3*(x,t)  —  H(a)]  . 

q,  OL — >  +  00 

The  change  of  measure  corresponding  to  a*(x,t)  =  +oo  (at  least  formally) 
is 

Q+oo  =  lim  Qa  = 

Qi — >-)-00 

Similarly,  if  / 3*(x,t )  <  0,  then  its  conjugate  is  a*(x,t)  =  — oo,  with  the 
corresponding  change  of  measure 

Q— oo  —  lim  Qa  — 
a-^—oo 

However,  neither  of  these  two  probability  transition  kernels  is  suitable  for 
the  purpose  of  importance  sampling,  since  the  probability  measure  induced 
by  the  original  probability  transition  kernel  Q  is  not  absolutely  continuous 
with  respect  to  the  probability  measure  induced  by  Q+oo  or  Q_oo- 

To  overcome  this  difficulty,  we  just  take  a  to  be  a  large  positive  or  neg¬ 
ative  number  whenever  a*(x,t)  =  +oo  or  a*(x,t )  =  — oo.  In  our  numerical 
simulation,  a  is  taken  to  be  5  if  a*(x,t)  =  +oo  and  —5  if  a*(x,t )  =  — oo. 
The  probability  transition  kernels  corresponding  to  a  =  ±5  are 

„  _  [  0.9999  0.0001  1  _  [  0.0047  0.9953  " 

Q+5  ~  [  1  0  J  ’  Q~5  ~  [  1  0  J  ’ 

which  are  very  close  to  Q±oo- 

Remark  4.4  In  the  high  dimensional  case  or  when  we  consider  function¬ 
als  more  general  than  probabilities,  there  may  be  no  clear-cut  definition  of 
“rogue”  trajectories.  This  is  because  the  non-optimal  behavior  (from  the 
point  of  view  of  variance  reduction)  of  the  traditional  importance  sampling 
can  creep  into  the  problems  in  more  subtle  and  insidious  ways.  See  [16] 
for  a  two  dimensional  example  where  Y  is  a  sequence  of  iid  normal  random 
variables. 
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Appendix  A:  Proof  of  Theorem  3.1 


Proof  of  Theorem  3.1.  Proposition  2.1  and  Condition  2.2  imply  that 
lim  —log P{Sn/n  £A}  =  -  inf  L(/3). 

n^oo  n  (3eA 

Thanks  to  the  discussion  in  Section  2.3,  it  suffices  to  show  the  lower  bound 
(3.1),  or 

lim  inf  Wn  >  2  inf  L(B). 

To  this  end,  we  extend  the  dynamics  as  in  Section  3,  and  consider  a  mollified 
version  of  the  original  control  problem.  In  other  words,  let  F  :  — >  R  be 
an  arbitrary  bounded  and  Lipschitz  continuous  function,  and  define  Vp,  Wft 
correspondingly;  see  the  discussion  from  equation  (3.1)  to  equation  (3.2). 

Since  Vp  is  the  value  function  of  a  control  problem,  it  satisfies  the  Bell¬ 
man  equation  [4] 


Vp(x,y,i) 

=  inf  [ 

aeRd  Js 


0-2<a,g(z))+2.ff(a)  . 


r2(y;a)„r„  (  1 


r2(z\  a) 


VP  [x  +  -g(z),z-,i  +  1 


n 


e(a,g(z))-H(a)  ■  r^’a)  P(y,dz) 


r{y;a ) 


inf  /  e-{^9{z))+H(a)  .  rJ^lVn  fx+  l  (  )  i  +  A  (  dz) 

amdJs  r(z;a)  \  n  ) 


together  with  terminal  condition  Vp(x,y;n )  =  exp{— 2nF(x)}.  It  follows 
from  (3.2)  that 


WjKx,y\i)  (A.l) 

=  --log  inf  [  e-«a^(*)>+H(a)  .  ?j3^e-nWF(x+nBM™+1)p(y,dz) 
n  aeRdJs  r{z;a) 


and  that  Wp(x,y;n )  =  2F(x). 

The  discussion  in  Section  3  now  prompts  the  following  definition.  Fixing 
an  arbitrary  m  G  N,  for  0  <  A’  <  m  —  1  define  recursively, 


Up(x;  k )  =  sup  inf 


m 


Up  (  x  H - / 3 ;  k  +  1  j  H - (T(/3)  +  (ct,  /?)  —  -ff(cr)) 


m 


given  the  terminal  condition 

UP(x-m)  =  2F{x),  V  x  G  Md. 


(A.2) 

(A.3) 


22 


See  Section  3  for  the  interpretation  of  Wp  and  Up  as  lower  values  of  games. 
The  key  observation  is  the  following  lemma,  whose  proof  is  deferred  to  Ap¬ 
pendix  C. 

Lemma  A.l  For  an  arbitrary  sequence  xn  — >  x  G  we  have 

lirninf  inf  Wp(xn ,  y,  [nk/m\ )  >  U™(x ;  k),  k  =  0, 1, . . . ,  m. 

n  >oo  y^S 


Assume  Lemma  A.l  holds  for  the  moment.  All  that  remains  to  show  is  the 
inequality 

lirninf  U^(x;  0)  >  2  inf  {Li  (3)  +  Fix  +  (3)}.  (A.4) 

m— >oo  /3eRd 

Indeed,  suppose  (A.4)  is  true.  Fix  an  arbitrary  j  6  N,  and  define  Fj  (y)  = 
j(d(y,A )  A  1),  which  is  bounded  and  Lipschitz  continuous.  Since  1  a(v)  < 
exp{— 2 nFj(y)},  we  have 

lim  inf  Wn  >  lim  inf  Wp.  (0,  yy,  0) 

>  lim  inf  Up'.  (0;  0) 

>  2  inf  [L(0)  +  F0)\. 

/3eRd 

Exactly  as  on  [14,  pages  10-11],  a  compactness  argument  shows  that 
lim  inf  {L((3)  +  Fj(/3)}  =  inf  L(/3), 

]— >0O/3gRd  /3£A 

and  we  complete  the  proof. 

Now  we  show  inequality  (A.4).  The  idea  is  to  represent  Up  as  the  value 
function  of  a  control  problem  with  the  help  of  the  min/max  theorem.  To 
this  end,  define 


C  = 


9  g  V(Rd )  : 


J  L(/3)  9(d/3)  <  ooj 


and  rewrite  (A. 2)  as 


Uf{x\  k ) 


sup  inf  [  Up'  (x  +  — (3 ;  k  +  1^  0{d(3) 

QgRd  seC  L J  \  m  J 

+  ^  (/ L(0)e(d/3)  +  (a,J (39m) 


We  make  the  following  useful  observation,  whose  proof  is  deferred  to  Ap¬ 
pendix  C. 
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Lemma  A. 2  Up{-\  k)  is  bounded  and  Lipschitz  continuous  for  every  k.  In¬ 
deed 

\\Up{x]k)\\  <  211^1100,  Vxeld,  k  =  0,1, . . .  ,m, 

and  Up{-\  k)  is  Lipschitz  continuous  with  Lipschitz  constant  2 Lp,  where  Lp 
is  the  Lipschitz  constant  for  the  mollifier  F. 

The  next  lemma  is  a  version  of  min/max  theorem,  whose  proof  is  almost 
identical  to  [16,  Lemma  2.2]  and  whence  omitted. 

Lemma  A. 3  For  any  bounded  and  lower  semi- continuous  function  f  :  — » 

M,  we  have 


sup  inf 

agRd  0eC 


=  inf  sup 

0eCaeRd  L 


UP(x;k )  =  inf  sup 
0eCaeRd 


inf 

e&c 


J  f{0)  d9  +  j  L{0)  de  +  (a,  J  0d6^-  H{a) 

f{0)  dO  +  J  L(0 )  d6  +  ^a,  J  (3  dO^j  —  H(a ) 

Thanks  to  Lemma  A. 2  and  Lemma  A. 3,  we  obtain 

J  UP  (.r;  +  k  +  l)  9{d0) 

+  ^  L{0)  0{d(3)  +  (a,  J  Pd{d0)^  -  H(a) 

f  UP  (x  +  -fck  +  l)  9{dp) 

\  m  J 

+  ^~(J  HP)  9{d0)  +  L^J  09 {d0) 


(A.5) 


This  last  display  implies  that  Up  has  an  interpretation  as  the  minimal  cost  of 
a  stochastic  control  problem.  To  simplify  the  notation,  we  state  the  control 
problem  only  for  the  case  k  =  0.  The  control  problem  will  be  defined 
on  a  probability  (Q,F,  P),  and  Ex  will  denote  that  the  initial  condition 
of  the  state  process  is  x.  An  admissible  control  is  a  sequence  {V”,  j  = 
0, 1, . . . ,  ?77.  —  1},  with  each  i/f  be  a  stochastic  kernel  on  given  Rrf.  Given 
an  admissible  control  sequence,  the  state  dynamics  are  defined  by  S™  =  rrix 
and 


om  ■  Qiri  ,  \/ra 
DJ+1  —  Oj  -f-  I  j  , 


where 


p  {vp  £dy\YP,0<i<  j}  =P{rpe  dy  \  S?/m}  =  i${dy  \  Sf/m). 
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We  then  define  the  value  function 


m—  1  i 

‘  r  /  r  \ ' 

E  - 

—  m 

_j=0 

j  L(y)v™{dy)  +  L^J  y^(dy)j 

+  2F(SZ/m) 

where  the  infimum  is  taken  over  all  controls  {v™}  and  resulting  controlled 
processes  { S™/m }  that  start  at  x  at  time  0.  Since  v™  also  satisfies  the  DPE 
(A. 5)  [4,  Chapter  8]  and  terminal  condition  v'p(x-,m)  =  U™(rr;m)  =  2 F(x), 
we  obtain  by  induction  that  f7™(x;fc)  =  v™(x\k)  for  all  x  G  and  k  G 
{0, . . .  ,m}. 

Define  a  stochastic  kernel  um  on  given  [0, 1]  by 


vm(dy\t) 


vf(dy)  if  t  G  \j/m,  (, j  +  1  )/m),  j  =  0, 1, . . .  ,m  -  2 

um-i(dy)  if  1  €=  ](m  -  1  )/m,  1] 


Let  A  denote  Lebesgue  measure.  Then  the  definition  of  vm(dy\t)  and  the 
convexity  of  L  imply  that 


Up(x;0) 


=  inf  Ex 


>  inf  Ex 


=  inf  Ex 
T'"1} 


t  f  L(y)S"(dy\t)dt+  £  -l(!  wTm)  +mszim) 
Jo  J Rd  m  \J-Rd  J  ) 

f  [  L{y)vm(dy\t)  dt  +  L  (  £  -  /'  yuf(dy)\  +2 F(S%/m) 
Jo  J Rd  \~km  J Rd  / 


\3= 0 


/Rdx  [0,1] 


L(y)um(dy  x  dt)  +  L 


/Rdx  [0,1] 


yvm(dy  x  dt) 


+  2  F($Z/m) 


where  x  dt)  =  um\dy\t)dt.  A  straightforward  weak  convergence  ap¬ 

proach  will  be  adopted  to  derive  the  desired  inequality  (A. 4).  Since  the 
proof  is  essentially  the  same  as  [14,  Theorem  5.3.5],  we  only  give  a  sketch. 

For  each  e  >  0,  there  exist  a  sequence  of  controls  {z/m,  m  G  N}  such  that, 
for  every  m,  we  have 


Vtf(x-,0)+e>Ex 


2  F(SZ/m) 
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Furthermore,  since  L  is  non-negative  and  F  is  bounded,  we  have 
sup  Ex  /  L(y)vm(dy  x  dt)  <  oo 

meN  v/Rdx  [0,1] 

However,  since  function  L  is  superlinear  (Proposition  2.1),  it  is  not  difficult 
to  check  that  {um}  is  uniformly  integrable  in  the  sense  that 


lim  sup  Ex  [  \\y\\vm(dy  x  dt)  =  0. 

C-°°meN  J{y:\\y\\>C}x[0,l] 


It  follows  from  that  proof  of  [14,  Proposition  5.3.2]  that  {Vm}  is  indeed 
tight.  Therefore  we  can  extract  a  weakly  convergent  sub-subsequence,  still 
denoted  by  {^m},  such  that  =4>  v  for  some  stochastic  kernel  u  whose 
second  marginal  is  Lebesgue  measure  [14,  Lemma  5.3.4],  We  utilize  the 
Skorokhod  representation  [6],  which  allows  us  to  assume  (when  calculating 
the  limits  of  the  integrals)  that  the  convergence  is  actually  w.p.l.  It  follows 
from  the  uniform  integrability  of  {um}  and  the  proof  of  [14,  Proposition 
5.3.5]  that 


/  yum(dy  x  dt)  — >  /  yv(dy  x  dt) 

v/Rdx  [0,1]  J Rdxf0,ll 


Rdx  [0,1] 


and 

S™/m  Z  =  x  +  [  yy[dy  x  dt). 

v/Rdx  [0,1] 


Furthermore,  it  follows  from  the  lower  semicontinuity  and  non-negativity  of 
L  [14,  Lemma  A. 3. 12]  that,  with  probability  one, 


lim  inf  [  L(y)vm(dy  x  dt)  >  [  L{y)v(dy  x  dt). 

m  ./Rdx[0,l]  v/Rdx[0,l] 

Thanks  to  convexity  of  L  and  Jensen’s  inequality,  we  have 


/Rdx  [0,1] 


L(y)v(dy  x  dt)  >  L 


/Rdx  [0,1] 


yv{dy  x  dt) 


By  Fatou’s  Lemma  and  the  lower-semicontinuity  of  L  [25],  we  have 


lim  inf  U™(x]  0)  +  e  >  Ex 


2  L 


/Rdx[0,l] 


yv{dy  x  d.t)  ]  +  2 F(Z) 


It  is  now  trivial  that  the  right  hand  side  of  the  last  inequality  is  bounded 
below  by 

2  inf  [L(/3)  +  F(x  +  /3)\  . 


Since  £  >  0  is  arbitrary,  (A. 4)  follows  readily,  which  completes  the  proof. 
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Appendix  B:  A  large  deviation  upper  bound 


In  this  section,  we  study  a  uniform  large  deviation  principle  upper  bound, 
which  is  essential  for  proving  the  key  Lemma  A.l.  We  present  a  proof  based 
on  the  weak  convergence  approach  [14].  Alternatively,  one  can  adapt  the 
methodology  in  [13]. 

The  following  two  lemmas  will  be  useful. 

Lemma  B.l  Suppose  S  is  a  Polish  space  and  V(S)  is  the  space  of  proba¬ 
bility  measures  on  S  endowed  with  the  weak  convergence  topology.  Consider 
a  sequence  of  random  variables  pn  :  (On ,  En ,  Pn)  — *  V(S).  In  other  words, 
{pi1}  is  a  sequence  of  random  probability  measures.  Then  {gn}  is  tight  if 
and  only  if  the  sequence  { Enpn }  is  tight.  Here  Enpn  e  V(S)  is  defined  by 

{Enpn)(A)  =  f  pn(uj){A)  Pn(du) 

for  every  Borel  set  A  inVfS). 

Proof.  See  [22,  Theorem  6.1,  Chapter  1],  ■ 

Lemma  B.2  Suppose  S  is  a  Polish  space,  {pn}  C  V(S),  and  p{-,  •)  a  prob¬ 
ability  transition  kernel.  If  pn  — *  p  in  the  r-topology  for  some  p  €  V(S), 
then 

pn  ®p  — >  p®p 

in  the  t -topology.  Here  p®p  denotes  the  probability  measure  on  S  xS  given 

(»®P)(B)  =  f 

for  every  Borel  set  BC.Sy.S- 

Proof.  It  suffices  to  show  that 

/  f{x,y)pn{dx)p{x,dy)  -*■  /  f{x,y)p{dx)p(x,dy) 

JSxS  JSxS 

for  every  bounded,  measurable  function  /.  Since  pn  — >  p  in  the  r-topology, 
it  remains  to  show  that 

J  f(x,y)p(x,dy) 

is  a  bounded  and  measurable  function  (over  x ).  The  boundedness  is  trivial, 
and  the  measurability  follows  from  Fubini’s  Theorem;  c.f.  [5,  Exercise  18.20]. 
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Proposition  B.3  Suppose  Y  =  {Yj,j  G  N}  is  a  Markov  chain  that  takes 
values  in  a  Polish  space  S.  Let  p  denote  the  probability  transition  kernel 
of  Y ,  and  assume  Condition  2.1  holds.  Suppose  g  :  S  — >  is  a  bounded 
measurable  function,  and  define  H  and  L  as  in  (2.1)  through  (2.3).  Then 
for  any  fixed  a  G  Rd,  bounded  and  continuous  function  f  :  >  M,  and 

sequence  xn  — >  x  G  we  have 


lim  inf  inf - log  Ev 

n—>oo  y£S  n 


s«)> -n/(*n+™ £"=° 9(Yj)) 


where 


If(x)  =  inf  [f(x  +  P)  +  L(/3)  +  (a,  (3)} 


Proof.  Let 


y)  =  — log  Ey 

n 


-'(’tSw t »«>) 


=  -ilog 

n  J  y 

Here  tt ™  is  the  joint  distribution  of  (Yq,  Yi,  •  •  • ,  Yn- 1),  or 

-JTy(dy0,  dyi,---,  dyn )  =  6y(dy0)p(y0,  dyi)p(yi,  dy2 )  •  ■  ■  p(yn-i,dyn). 

Clearly  vn  is  bounded,  thanks  to  the  boundedness  of  g  and  /.  It  suffices  to 
show  that  for  every  sequence  xn  — >  a:  and  {yn}  C  5, 


lim  inf  vn{xn,yn)  >  If(x). 


(B.l) 


For  an  arbitrary  e  >  0,  the  relative  entropy  representation  of  exponential 
integrals  (3.3)  [14,  Proposition  1.4.2]  yields  the  existence  of  a  probability 
measure  pn  on  5ra+1  such  that 


vn(xn,yn)+e 


(B.2) 


1 

>  - 
n 


R(Tn  IK-)  +  (a,J  +/  f(xn  +  l  X>(y,)j  dPn- 


In  particular,  it  is  not  hard  to  see  that 


1 


sup  —R(pn  ||7r”n)  <  oo. 
nG  N  ^ 


(B.3) 
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We  can  factor  yP  as 

Vn(dy0,  dyi,  •  ■  • ,  dyn )  =  Mo  (%))Mi  (dj/ibo)  ■  ■  ■  yZ(dyn\yn-i,yn-2,  •  •  • ,  Vo)- 

Now  consider  a  probability  space  P),  on  which  we  define  a  stochastic 

process  given  by 

P(Y0nedy)  =  fjfiidyo) 

P(Y?+1€dy\Y?,i  =  0,l,<..,j)  =  ft".  \  (dy\Yj" ,  i.  -  0.1....,./) 
for  j  =  0, 1, . . ,  n  —  1.  To  ease  exposition,  let 


^R(yn\\^n)  =  ^E  W||V)  +  E^(^+i(-)ll«v))  •  (B.4) 

L  3=0 

However, 

n— 1 

i— o 

1  n— 1 

=  -  X)  Rtfyfidy)  x  jSy+1(d«)||5y„(dj/)  x  p(Yp,dz)) 
n  j= o  J  J 

^  n— 1 

=  -  X  R(f>Yr(dy)  X  M"+i(dz)||6yn(dy)  ®P{y,dz)) 

3=0  J  J 

(  1  "_1  1  n_1  \ 
t>yn(dy)  x  /i”+1(d^r)  -  X  ^y~(dy)  <8>  p(y,dz) 

\n  j= 0  J  ni=0  3  / 

=  i2(7"||L"®p), 
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where  the  inequality  follows  from  the  convexity  of  relative  entropy  i?(-||-). 
Thanks  to  (B.2),  and  observing  that 


/ 


/ 


77,-1 


+  -  E »(» 


j= o 


dpn 


E  J  gd~Ln 
Ef  (xn  +  J  gdV^j 


we  arrive  at 
vn(xn,yn)  +  e>E 


R{1n\\Ln®p)  + 


+  /(*"  + 


It  suffices  to  show  {7 n }  is  tight.  Indeed,  if  this  is  true,  the  same  argument  as 
in  [14,  Theorem  8.2.8]  allows  us  to  extract  a  weak  convergent  subsequence 
of  (7 n,Ln),  still  indexed  by  n,  such  that 

(7",Tn)^(7,i) 


for  some  stochastic  kernel  7  on  S  x  S  and  some  stochastic  kernel  on  S,  and 
a  (random)  transition  probability  function  q  such  that 

7 (dy  x  dz )  =  L(dy)  <g>  q(y,  dz) 


and 

Lq  =  L  (B.5) 

hold  almost  surely.  In  particular,  we  have 

(7n)2  =4>  (7)2  =  Lq  =  L 

Note  equation  (B.5)  says  that  L  is  indeed  the  invariant  measure  for  the 
transition  probability  function  q.  Also  observe  that 

sup  ER(-fn\\Ln  ®  p)  <  00. 

nSN 

This  implies  the  existence  of  a  subsequence,  still  indexed  by  n,  such  that 

~Ln  -  L,  (7n)2  L 

in  the  r-topology;  see  the  proof  of  [14,  Lemma  9.3.3].  Therefore, 

J  g  dLn  — >  J  gdl 
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almost  surely.  Furthermore,  thanks  to  Lemma  B.2,  Ln  <S>  p  — >  L  <g>  p  in  the 
r-topology  (hence  in  the  weak-topology)  almost  surely.  The  lower  semi¬ 
continuity  of  i?(- 1|  ■)  implies 

liminf  R('yn\\Ln  ®p)  >  R(^\\L  <g  p). 

It  follows  readily  from  Fatou’s  Lemma  that 


liminf  vn(xn,  yn)  +  e 

n—>oo 

>  E 

=  E  I  R(L  <g>  q\\L  ®p)  +  (a,  I  g  cLL)  +  f  [x  +  I  g  cLL 


R{l\\L  ®p)  +  (a, j  g  dL j  +f[x+  f  gdL 


>  inf 


R(t*  ®  q\\fi®  p)  +  (a,  J  g  dy^j  +/  (x  +  J  g  dp 

Recalling  equation  (2.4)  and  letting  e  — >  0,  we  obtain 


liminf  vn(xn,  yn )  >  inf [L((3)  +  (a,  (3)  +  f{x  +  /?)]. 

which  is  the  desired  inequality  (B.l). 

It  remains  to  show  the  tightness  of  {y71}.  All  we  need  is  the  tightness 
of  the  two  marginals,  {(yn)i}  and  {( 7n)2 }•  However,  it  is  not  difficult  to 
observe  that 


H(yn)i  =  ELn 


71—1 


-  V  E6y 
n  —  ■ 


3=0 


1 

n 


71— 1 


E  /‘“’t 

J-0 


where  pn'3  denotes  the  j-th  marginal  of  the  probability  yn,  and  similarly 

^  n— 1  ^  7i—l 

77”)2  =  -E^‘”+.  =  -Em”j+1. 

3=0  3= 0 

Letting  ||  •  ||„  denote  the  total  variation  metric,  we  have 

||H(7n)i  -  E( ^)2\\v  =  -||/rn’°  -  pn'n\\v  <  -•  (B.6) 

n  n 

If  we  can  show  {(7n)2}  is  tight,  then  Lemma  B.l  implies  {E( 7n)2}  is  tight, 
which  in  turns  yields  the  tightness  of  {E,(yn)i},  thanks  to  (B.6).  Applying 
Lemma  B.l  once  again,  we  have  the  tightness  of  {(7n)i}-  Therefore,  it  is 
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sufficient  to  show  that  { ("7^)2}  is  tight.  The  proof  will  distinguish  two  cases: 
mo  =  1  and  mo  >  1. 

Suppose  that  mo  =  1.  Note  that  the  non- negativity  of  relative  entropy, 
(B.3),  and  (B.4)  imply 


It  follows  from  the  assumption  of  uniform  recurrency 
avp{-)  <p{y,-)  <bvp{-),  V  y  €  S, 

that 

R(^+i(-)\\p(yrr))>cR(fi]+1\Wp) 

for  some  constant  c  >  0.  It  is  now  easy  to  derive  from  the  convexity  of 
relative  entropy  that 

^  n—  1 

sup  ER(('yn)2\\vp)  <  sup  E-  V  R(jij+1  \\vp)  <  00, 

neN  neN  j= q 

which  further  implies  the  tightness  of  {( 7n)2}  since  R(-\\isp)  is  a  tightness 
function  on  V(S).  Note  that 

n—  1 

E(ln)  2  =  -E^'+1 

nU 

is  also  tight,  thanks  to  Lemma  B.l. 

The  general  case  with  mo  >  1  is  slightly  more  complicated.  We  will  give 
a  proof  with  mo  =  2,  and  observe  that  the  proof  for  mo  >  2  is  essentially  the 
same  and  thus  omitted.  Without  loss  of  generality,  we  show  {(7n)2  :  n  even} 
to  be  tight.  The  tightness  for  {(7n)2  :  n  odd}  is  similar. 

To  ease  notation,  let  nn  =  7r”n ,  and  7rn,e  be  the  marginal  distribution  of 
7 rn  over  even  coordinates;  i.e. 

nn,e(dy0,  dy2,  •  •  • ,  dyn- 2,  dyn)  =  Sy{dy0)p(2) {yo,  dy2 )  ■  •  •  p(2\yn-2,  dyn). 
One  can  similarly  define  yn,e,  or 

Pn,e{dy0,  dy2,...,  dyn )  =  (dy0)  y2’e  {dy2\y0)  ■  ■  ■  y%e(dyn\yn-2,  ...,y2,yo). 

Thanks  to  the  chain  rule  and  non-negativity  of  the  relative  entropy,  we  have 

R{yn’e\\irn’e)  <  R{pn\\irn), 
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and  thus  supn  ^R(/j,n,e\\Trn,e) 
m-o  =  1,  we  have  that 


<  oo.  With  the  same  proof  as  for  the  case 


2 

n 


]T  /in’e’J+1 
3=0 


is  tight;  here  ^n,ej  is  the  j-th  marginal  of  yn,e]  i.e. 


Hn'e'j{dy2j)  =  yn’e(S, dy2j,S, . . . , 5). 


One  can  similarly  define  yn'°  as  the  marginal  distribution  of  pn  over  odd  co¬ 
ordinates,  and  the  same  argument  can  be  carried  over  to  prove  the  tightness 
of 


-  V 

n  f-r! 


3+ 1 


3=0 


However,  observe  that 


Hn’e’i  =  =  /in’2^+1. 


We  have 


--1 
1  /  2  2 


--1 
2  2 


n—  1 


+  =«Ea*bJ+i  =  ^(7" 

2  n  “  n  r-r!  2  f-f 


i=o 


j=o 


j=o 


This  implies  the  tightness  of  {.E( 'yn)2  ■  n  even},  which  is  equivalent  to  the 
tightness  of  {(7n)2  :  n  even},  thanks  to  Lemma  B.l.  ■ 


Appendix  C:  Proof  of  Lemma  A.l  and  Lemma  A. 2 


Proof  of  Lemma  A.l.  The  proof  is  by  induction.  For  k  =  m,  we  have 
[' nk/m\  =  n.  By  definition, 


lim  inf  inf  Wp(xn,  y\  n) 

n—> OO  y£S 


lim  inf  inf  2 Fix11) 
n— >oo  y£S 


lim  inf  2 Fix11), 
n— >  oo 


and  Lemma  A.l  follows  trivially  from  the  continuity  of  F. 

Assume  now  the  claim  holds  for  k+ 1.  Let  £(n)  —  \n{k+l) / m\  —  |_nA;/mJ . 
Also,  let  TTJy  be  the  probability  measure  on  5:,+1  defined  by 


T lidyo ,  dyi,---,  dyj)  =  8y(dy0)p(yo ,  dyi)p(yi,dy2)  ■  ■  -p(yj-i,  dyj) 
for  every  y  G  S  and  every  j  £  N. 
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For  an  arbritrary  a  G  Rd,  let 


U™F(x;fc)=  inf 


m 


1 


Up'  x  H - /?;  k  +  1  H - (L(/3)  +  (a,  /3)  —  H(a)) 


m 


It  follows  from  the  definition  that  Up,{x\k )  =  supa  U™F(x;  k).  Therefore, 
all  we  need  to  show  is  that,  for  every  a  G  and  any  sequence  xn  — >  x, 

liminf  inf  WF(xn,y,  [ nk/m\ )  >  U™F(x;k). 

n — »oo  yGtS 

However,  for  an  arbitrary  fixed  a  G  Md,  the  dynamic  programming  prin¬ 
ciple  implies  that 


£(n) 


urn?  \  1  /  \  1  i  f  —{a,We(n^  g(yi))+£(n)H(a)  tt  r{Vj— li^O 

M'pfx,  u;  I  rafc/rn  )  > - log  /  e  ^='  •  — r - r~ 

n  J  r  {yFa) 

^-nWp(x+±  ^^o_l9(w)’^(™);Ln(fc+1)/m)j)  ^(n) 

__  —lino-  [  9(yi))+t(n)H(a)  r{lj0\ a) 

n  gJ  r(ye{ny,a) 

^-nWy(x+±  Xl^o)_1  9(%)>^(™)hn(fe+1)/m)J^  ^(n) 

Since  g  is  bounded  and  ?’(•;«)  is  both  bounded  from  above  and  bounded 
away  from  zero  by  (2.2),  it  suffices  to  show 


liminf  inf  vF(xn,  y:  0)  >  U™F(x]  k), 

n^o o  yeS  "  /  —  <x,r  \  i  n 


(C.l) 


where 


^(rr,2/;0)  = - log  /  e" 

n 


,-(a’I^jio  9(yi))+#(n)H(a) 

-nW£^x+±  g(yj),y£(„);ln(k+l)/m]'j  ^£(n) 

We  claim  that  inequality  (C.l)  is  a  direct  consequence  of 

liminf  inf  vF(xn,  y,  0)  >  U™F(x\k),  (C.2) 

where 


•  e 


vnF{x,y-  0)  =  --log  f  e-{a^=o -1 9{yi))+t{nma) 
n  J 


~nUF  (x+k  E'L”o)_1 9(w);fc+i)  d^(n) _ 
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Indeed,  since  t(n)  <  n,  one  can  always  find  a  compact  set  K  C  Rd  such  that 
^  ^(n)— i 

9(Vj)  v  (yo,  2/i,  •  •  • ,  %(«)),  V  n  €  N, 

n  j= o 

thanks  to  the  boundedness  of  <7  and  the  assumption  xn  — >  x.  It  is  also 
not  hard  to  show  by  contradiction  from  the  induction  hypothesis  and  the 
continuity  of  Up  (Lemma  A. 2)  that,  for  any  e  >  0,  there  exists  N(e)  €  N 
such  that  for  all  x  G  K  and  n  >  N(e), 

inf  Wp(x,  y;  [ n(k  +  l)/mj)  —  Up(x;  k  +  1)  >  —s. 


We  arrive  at 


liminf  inf  Vp(xn,  y\  0)  >  liminf  inf  Vpix11,  y;  0)  —  e 

n—>00  y£S  n— >00  y£S 


for  every  e  >  0.  It  follows  that  (C.l)  is  implied  by  (C.2). 

It  remains  to  show  (C.2),  which  is  an  easy  consequence  of  the  uniform 
large  deviation  bound  Proposition  B.3,  Lemma  A. 2,  boundedness  of  g,  and 
that 

IM _ L  0 

n  m 

This  completes  the  proof.  ■ 

Proof  of  Lemma  A. 2.  That  UpL(--,k )  is  Lipschitz  continuous  with  Lip- 
schitz  constant  2 Lp  follows  trivially  by  induction,  and  the  terminal  condition 
(A.3). 

As  for  the  boundedness  of  UpL(-]k)1  we  first  show  it  is  bounded  from 
below.  Since  H( 0)  =  0  and  L  is  non-negative,  definition  (A. 2)  gives 


Up'(x\  k)  >  inf 


Up1  (  x  H - /3;  k  +  1 )  -\ - L(/3) 


TO 


TO 


/SeK^ 


>  inf  (  x  H - /3;  k  +  1 


TO 


=  inf  Uf  (z-  k  +  1) 


for  every  x.  It  follows  that,  for  every  k, 

inf  U"p(x;  k)  >  inf  Up  (x;  k  +  1)  >  ■  ■  ■  >  inf  Up  (x;  to)  >  —  2HPHOQ. 

x£Rd  x£Rd  x£Rd 
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It  remains  to  show  that  Up  is  bounded  from  above.  Let  (3  be  a  subdifferential 
of  the  convex  function  H  at  a  =  0.  Then 

L(j3 )  =  sup  [(«,  j3)  —  H(a)\  =  0 

aSRd 


and  the  supremum  is  achieved  at  a  =  0.  By  definition  (A. 2)  again,  we  have 


UP(x]  k )  <  sup 

oeMd  L 


1 


m 


1 


Up1  x  H - /?;  k  +  1  I  H - ((cr,  (3)  —  H[a)) 


m 


—  U'p  f  x  - /3;  k  +  l^j 

\  m  J 

<  sup  (z;  k  +  1) 
zeKd 


for  every  x.  It  follows  that,  for  every  k, 

sup  UP(x\  k )  <  sup  UP  (x;  k  +  1)  <  ■  ■  ■  <  sup  UP  ( x ;  m)  <  2||^1||0O. 

a;SMd  x£Rd  x£Rd 

This  completes  the  proof.  ■ 
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